from google.colab import drive
drive.mount('/content/drive', force_remount=True)
Mounted at /content/drive
!jupyter nbconvert --to markdown '/content/drive/MyDrive/Colab Notebooks/El Nino DataAnalysisProjectNotebook.ipynb'
[NbConvertApp] Converting notebook /content/drive/MyDrive/Colab Notebooks/El Nino DataAnalysisProjectNotebook.ipynb to markdown [NbConvertApp] Support files will be in El Nino DataAnalysisProjectNotebook_files/ [NbConvertApp] Writing 4515983 bytes to /content/drive/MyDrive/Colab Notebooks/El Nino DataAnalysisProjectNotebook.md
Week 1: Establish your Group and Identify Group Roles¶
Hello Scholars and welcome to the Howard University Applied Data Science and Analytics Bootcamp. As part of your first week activities you have been assigned a group to collaborate for a data analytics project. The goal is to demonstrate how the skills you are learning contribute to the analysis of data and determine actionable insights. As you work through the document, there are comments to the right to help guide you. When they are completed you should click the blue checkmark and resolve them.
In the first week of your group project you will identify the role of each of your group members, develop a research question, and identify a dataset you will use to answer the research question.
Observations of the Data Set¶
This data was collected with the Tropical Atmosphere Ocean (TAO) array which was developed by the international Tropical Ocean Global Atmosphere (TOGA) program. The TAO array consists of nearly 70 moored buoys spanning the equatorial Pacific, measuring oceanographic and surface meteorological variables critical for improved detection, understanding and prediction of seasonal-to-interannual climate variations originating in the tropics, most notably those related to the El Nino/Southern Oscillation (ENSO) cycles.
The moorings were developed by National Oceanic and Atmospheric Administration's (NOAA) Pacific Marine Environmental Laboratory (PMEL). Each mooring measures air temperature, relative humidity, surface winds, sea surface temperatures and subsurface temperatures down to a depth of 500 meters and a few a of the buoys measure currents, rainfall and solar radiation. The data from the array, and current updates, can be viewed on the web at the this address .
The El Nino/Southern Oscillation (ENSO) cycle of 1982-1983, the strongest of the century, created many problems throughout the world. Parts of the world such as Peru and the Unites States experienced destructive flooding from increased rainfalls while the western Pacific areas experienced drought and devastating brush fires. The ENSO cycle was neither predicted nor detected until it was near its peak. This highlighted the need for an ocean observing system (i.e. the TAO array) to support studies of large scale ocean-atmosphere interactions on seasonal-to-interannual time scales.
The TAO array provides real-time data to climate researchers, weather prediction centers and scientists around the world. Forcasts for tropical Pacific Ocean temperatures for one to two years in advance can be made using the ENSO cycle data. These forcasts are possible because of the moored buoys, along with drifting buoys, volunteer ship temperature probes, and sea level measurements.
Research questions of interest include:
- How can the data be used to predict weather conditions throughout the world?
- How do the variables relate to each other?
- Which variables have a greater effect on the climate variations?
- Does the amount of movement of the buoy effect the reliability of the data?
- When performing an analysis of the data, one should pay attention the possible affect of autocorrelation. Using a multiple regression approach to model the data would require a look at autoregression since the weather statistics of the previous days will affect today's weather.
Spaces separate fields and periods (.) denote missing values.
Week 2: The Data Science Process¶
The data science process consists of seven key steps. Each week we will complete another step of the process. The work we complete in class each week will structure the work you will complete on the group project. This week you will work on working with your dataset file, loading data into the python environment, and making a plan for data cleaning steps.
Data Collection:¶
Data collection begins with identifying a reliable and accurate data source and using tools to download the dataset for examination. The drive is mounted and imported to Colab, creating a link to the data source. Next, the necessary libraries are imported which contain pre-written code which perform specific tasks. Python has several libraries which are powerful tools from data analysis and visualization.
Once the dataset are loaded and the libraries imported, the dataset can be read and the dataframe created. Now the data is checked and the data cleaning process begins.
Examine the dataset¶
Describe Methodology.
The data is available from the UC Irvine Data Repository and can be imported directly from the repo for the most current data. This requires a pip install of the ucimlrepo fetch tool and library import. The data description indicates the data is raw scientific measurements from remote station buoys placed throughout the Pacific Ocean. The data is likely incomplete; missing key values for some of the measurements. The data spans back to the 1980s and some of the technology for the remote measurements would not have been hardened or resilent to the roughness of the Pacific Ocean and thus values may not be available for some fields.
Import libraries¶
# Import the libraries
import numpy as np # Scientific Computing
import pandas as pd # Data Analysis
import matplotlib.pyplot as plt # Plotting
import seaborn as sns # Statistical Data Visualization
# Let's make sure pandas returns all the rows and columns for the dataframe
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
# Force pandas to display full numbers instead of scientific notation
# pd.options.display.float_format = '{:.0f}'.format
# Library to suppress warnings
import warnings
warnings.filterwarnings('ignore')
Install the UCI Repo Connection Library¶
pip install ucimlrepo
Collecting ucimlrepo Downloading ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB) Requirement already satisfied: pandas>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from ucimlrepo) (2.0.3) Requirement already satisfied: certifi>=2020.12.5 in /usr/local/lib/python3.10/dist-packages (from ucimlrepo) (2024.6.2) Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.0.0->ucimlrepo) (2.8.2) Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.0.0->ucimlrepo) (2023.4) Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.0.0->ucimlrepo) (2024.1) Requirement already satisfied: numpy>=1.21.0 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.0.0->ucimlrepo) (1.25.2) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2->pandas>=1.0.0->ucimlrepo) (1.16.0) Installing collected packages: ucimlrepo Successfully installed ucimlrepo-0.0.7
Read the dataset¶
# Import the UCI Repo Connection Library
from ucimlrepo import fetch_ucirepo
# fetch dataset
el_nino = fetch_ucirepo(id=122)
# data (as pandas dataframes)
remoteStations = el_nino.data.features
# metadata
print(el_nino.metadata)
# variable information
print(el_nino.variables)
{'uci_id': 122, 'name': 'El Nino', 'repository_url': 'https://archive.ics.uci.edu/dataset/122/el+nino', 'data_url': 'https://archive.ics.uci.edu/static/public/122/data.csv', 'abstract': 'The data set contains oceanographic and surface meteorological readings taken from a series of buoys positioned throughout the equatorial Pacific.', 'area': 'Climate and Environment', 'tasks': ['Other'], 'characteristics': ['Other'], 'num_instances': 178080, 'num_features': 11, 'feature_types': ['Integer', 'Real'], 'demographics': [], 'target_col': None, 'index_col': ['obs'], 'has_missing_values': 'yes', 'missing_values_symbol': 'NaN', 'year_of_dataset_creation': 1999, 'last_updated': 'Sat Mar 16 2024', 'dataset_doi': '10.24432/C5WG62', 'creators': [], 'intro_paper': None, 'additional_info': {'summary': "This data was collected with the Tropical Atmosphere Ocean (TAO) array which was developed by the international Tropical Ocean Global Atmosphere (TOGA) program. The TAO array consists of nearly 70 moored buoys spanning the equatorial Pacific, measuring oceanographic and surface meteorological variables critical for improved detection, understanding and prediction of seasonal-to-interannual climate variations originating in the tropics, most notably those related to the El Nino/Southern Oscillation (ENSO) cycles. \r\n\r\nThe moorings were developed by National Oceanic and Atmospheric Administration's (NOAA) Pacific Marine Environmental Laboratory (PMEL). Each mooring measures air temperature, relative humidity, surface winds, sea surface temperatures and subsurface temperatures down to a depth of 500 meters and a few a of the buoys measure currents, rainfall and solar radiation. The data from the array, and current updates, can be viewed on the web at the this address . \r\n\r\nThe El Nino/Southern Oscillation (ENSO) cycle of 1982-1983, the strongest of the century, created many problems throughout the world. Parts of the world such as Peru and the Unites States experienced destructive flooding from increased rainfalls while the western Pacific areas experienced drought and devastating brush fires. The ENSO cycle was neither predicted nor detected until it was near its peak. This highlighted the need for an ocean observing system (i.e. the TAO array) to support studies of large scale ocean-atmosphere interactions on seasonal-to-interannual time scales. \r\n\r\nThe TAO array provides real-time data to climate researchers, weather prediction centers and scientists around the world. Forcasts for tropical Pacific Ocean temperatures for one to two years in advance can be made using the ENSO cycle data. These forcasts are possible because of the moored buoys, along with drifting buoys, volunteer ship temperature probes, and sea level measurements. \r\n\r\nResearch questions of interest include: \r\n\r\n- How can the data be used to predict weather conditions throughout the world? \r\n- How do the variables relate to each other? \r\n- Which variables have a greater effect on the climate variations? \r\n- Does the amount of movement of the buoy effect the reliability of the data? \r\n- When performing an analysis of the data, one should pay attention the possible affect of autocorrelation. Using a multiple regression approach to model the data would require a look at autoregression since the weather statistics of the previous days will affect today's weather. \r\n\r\nThe data is stored in an ASCII files with one observation per line. Spaces separate fields and periods (.) denote missing values. \r\n\r\nMore information and data from the TAO array can be found at the Pacific Marine Environmental Laboratory TAO data webpage: http://www.pmel.noaa.gov/toga-tao/\r\n\r\nInformation on storm data is available here: http://www.ncdc.noaa.gov/pdfs/sd/sd.html. This site contains data from January 1994 to April 1998 in a chronological listing by state provided by the National Weather Service. The data includes hurricanes, tornadoes, thunderstorms, hail, floods, drought conditions, lightning, high winds, snow, and temperature extremes. \r\n\r\nHurricane tracking data for the Atlantic is available here: http://wxp.eas.purdue.edu/hur_atlantic/. The site contains a map showing the paths of the Atlantic hurricanes and also includes the storms winds (in knots), pressure (in millibars), and the category of the storm based on Saffir-Simpson scale. \r\n\r\nAnother site of interest related to the ENSO cyles is available here: http://www.cpc.ncep.noaa.gov/products/analysis_monitoring/ensostuff/current_impacts/precip_accum.html. This site contains information on twelve areas of the world that have demonstrated ENSO-precipitation relationships. Included in the site are maps of the areas and time series plots of actual daily precipitation and accumulated normal precipitation for the areas. \r\n", 'purpose': None, 'funded_by': None, 'instances_represent': None, 'recommended_data_splits': None, 'sensitive_data': None, 'preprocessing_description': None, 'variable_info': 'The data consists of the following variables: date, latitude, longitude, zonal winds (west<0, east>0), meridional winds (south<0, north>0), relative humidity, air temperature, sea surface temperature and subsurface temperatures down to a depth of 500 meters. Data taken from the buoys from as early as 1980 for some locations. Other data that was taken in various locations are rainfall, solar radiation, current levels, and subsurface temperatures. \r\n\r\nThe latitude and longitude in the data showed that the bouys moved around to different locations. The latitude values stayed within a degree from the approximate location. Yet the longitude values were sometimes as far as five degrees off of the approximate location. \r\n\r\nLooking at the wind data, both the zonal and meridional winds fluctuated between -10 m/s and 10 m/s. The plot of the two wind variables showed no linear relationship. Also, the plots of each wind variable against the other three meteorolgical data showed no linear relationships. \r\n\r\nThe relative humidity values in the tropical Pacific were typically between 70% and 90%. \r\n\r\nBoth the air temperature and the sea surface temperature fluctuated between 20 and 30 degrees Celcius. The plot of the two temperatures variables shows a positive linear relationship existing. The two temperatures when each plotted against time also have similar plot designs. Plots of the other meteorological variables against the temperature variables showed no linear relationship. \r\n\r\nThere are missing values in the data. As mentioned earlier, not all buoys are able to measure currents, rainfall, and solar radiation, so these values are missing dependent on the individual buoy. The amount of data available is also dependent on the buoy, as certain buoys were commissioned earlier than others. \r\n\r\nAll readings were taken at the same time of day. \r\n\r\n', 'citation': None}}
name role type demographic description units \
0 obs ID Integer None None None
1 year Feature Integer None None None
2 month Feature Integer None None None
3 day Feature Integer None None None
4 date Feature Date None None None
5 latitude Feature Continuous None None None
6 longitude Feature Continuous None None None
7 zon_winds Feature Continuous None None None
8 mer_winds Feature Continuous None None None
9 humidity Feature Continuous None None None
10 air_temp Feature Continuous None None None
11 ss_temp Feature Continuous None None None
missing_values
0 no
1 no
2 no
3 no
4 no
5 no
6 no
7 yes
8 yes
9 yes
10 yes
11 yes
More information and data from the TAO array can be found at the Pacific Marine Environmental Laboratory TAO data webpage: http://www.pmel.noaa.gov/toga-tao/
Information on storm data is available here: http://www.ncdc.noaa.gov/pdfs/sd/sd.html. This site contains data from January 1994 to April 1998 in a chronological listing by state provided by the National Weather Service. The data includes hurricanes, tornadoes, thunderstorms, hail, floods, drought conditions, lightning, high winds, snow, and temperature extremes.
Hurricane tracking data for the Atlantic is available here: http://wxp.eas.purdue.edu/hur_atlantic/. The site contains a map showing the paths of the Atlantic hurricanes and also includes the storms winds (in knots), pressure (in millibars), and the category of the storm based on Saffir-Simpson scale.
print(el_nino.metadata.target_col)
None
As a continuous dataset there is no target classifier designated. This will need to be an engineered feature to determine the occurrance of a storm based on the raw bouy measurements.
# Check the head of each dataframe
remoteStations.head(30)
| year | month | day | date | latitude | longitude | zon_winds | mer_winds | humidity | air_temp | ss_temp | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 80 | 3 | 7 | 800307 | -0.02 | -109.46 | -6.8 | 0.7 | NaN | 26.14 | 26.24 |
| 1 | 80 | 3 | 8 | 800308 | -0.02 | -109.46 | -4.9 | 1.1 | NaN | 25.66 | 25.97 |
| 2 | 80 | 3 | 9 | 800309 | -0.02 | -109.46 | -4.5 | 2.2 | NaN | 25.69 | 25.28 |
| 3 | 80 | 3 | 10 | 800310 | -0.02 | -109.46 | -3.8 | 1.9 | NaN | 25.57 | 24.31 |
| 4 | 80 | 3 | 11 | 800311 | -0.02 | -109.46 | -4.2 | 1.5 | NaN | 25.30 | 23.19 |
| 5 | 80 | 3 | 12 | 800312 | -0.02 | -109.46 | -4.4 | 0.3 | NaN | 24.72 | 23.64 |
| 6 | 80 | 3 | 13 | 800313 | -0.02 | -109.46 | -3.2 | 0.1 | NaN | 24.66 | 24.34 |
| 7 | 80 | 3 | 14 | 800314 | -0.02 | -109.46 | -3.1 | 0.6 | NaN | 25.17 | 24.14 |
| 8 | 80 | 3 | 15 | 800315 | -0.02 | -109.46 | -3.0 | 1.0 | NaN | 25.59 | 24.24 |
| 9 | 80 | 3 | 16 | 800316 | -0.02 | -109.46 | -1.2 | 1.0 | NaN | 26.71 | 25.94 |
| 10 | 80 | 3 | 17 | 800317 | -0.02 | -109.46 | -0.1 | 0.7 | NaN | 27.28 | 26.65 |
| 11 | 80 | 3 | 18 | 800318 | -0.02 | -109.46 | -1.2 | 2.3 | NaN | 26.86 | 27.13 |
| 12 | 80 | 3 | 19 | 800319 | -0.02 | -109.46 | -4.1 | -0.3 | NaN | 26.38 | 26.35 |
| 13 | 80 | 3 | 20 | 800320 | -0.02 | -109.46 | -4.8 | -0.8 | NaN | 26.19 | 25.87 |
| 14 | 80 | 3 | 21 | 800321 | -0.02 | -109.46 | -5.2 | 2.0 | NaN | 26.08 | 25.38 |
| 15 | 80 | 3 | 22 | 800322 | -0.02 | -109.46 | -2.7 | 2.7 | NaN | 26.24 | NaN |
| 16 | 80 | 3 | 23 | 800323 | -0.02 | -109.46 | -4.4 | 1.1 | NaN | 26.05 | NaN |
| 17 | 80 | 3 | 24 | 800324 | -0.02 | -109.46 | -4.3 | 0.7 | NaN | 25.67 | NaN |
| 18 | 80 | 3 | 25 | 800325 | -0.02 | -109.46 | -3.8 | 0.5 | NaN | 25.39 | NaN |
| 19 | 80 | 3 | 26 | 800326 | -0.02 | -109.46 | -3.0 | 0.2 | NaN | 25.17 | NaN |
| 20 | 80 | 3 | 27 | 800327 | -0.02 | -109.46 | -3.2 | -0.2 | NaN | 25.25 | NaN |
| 21 | 80 | 3 | 28 | 800328 | -0.02 | -109.46 | -1.9 | 0.7 | NaN | 25.35 | NaN |
| 22 | 80 | 3 | 29 | 800329 | -0.02 | -109.46 | -0.8 | 0.3 | NaN | 26.13 | NaN |
| 23 | 80 | 8 | 11 | 800811 | 0.00 | -109.56 | -3.3 | 1.5 | NaN | 21.48 | 21.81 |
| 24 | 80 | 8 | 12 | 800812 | 0.00 | -109.56 | -3.5 | 0.8 | NaN | 21.27 | 21.58 |
| 25 | 80 | 8 | 13 | 800813 | 0.00 | -109.56 | -4.9 | 1.9 | NaN | 21.11 | 21.32 |
| 26 | 80 | 8 | 14 | 800814 | 0.00 | -109.56 | -1.2 | 2.1 | NaN | 20.95 | 21.19 |
| 27 | 80 | 8 | 15 | 800815 | 0.00 | -109.56 | -1.2 | 2.7 | NaN | 21.76 | 21.47 |
| 28 | 80 | 8 | 16 | 800816 | 0.00 | -109.56 | -1.9 | 2.7 | NaN | 22.11 | 21.89 |
| 29 | 80 | 8 | 17 | 800817 | 0.00 | -109.56 | -4.2 | 0.7 | NaN | 21.69 | 21.85 |
# Check the tail of each dataframe
remoteStations.tail(30)
| year | month | day | date | latitude | longitude | zon_winds | mer_winds | humidity | air_temp | ss_temp | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 178050 | 98 | 5 | 17 | 980517 | 8.96 | -140.33 | -5.3 | -4.7 | 82.2 | 27.01 | 27.57 |
| 178051 | 98 | 5 | 18 | 980518 | 8.96 | -140.32 | -5.4 | -4.3 | 82.2 | 27.09 | 27.59 |
| 178052 | 98 | 5 | 19 | 980519 | 8.95 | -140.33 | -6.4 | -5.3 | 84.7 | 27.28 | 27.54 |
| 178053 | 98 | 5 | 20 | 980520 | 8.96 | -140.34 | -8.3 | -6.0 | 83.0 | 27.28 | 27.48 |
| 178054 | 98 | 5 | 21 | 980521 | 8.96 | -140.33 | -7.7 | -6.3 | 89.8 | 27.24 | 27.52 |
| 178055 | 98 | 5 | 22 | 980522 | 8.96 | -140.32 | -7.3 | -6.4 | 86.4 | 27.40 | 27.53 |
| 178056 | 98 | 5 | 23 | 980523 | 8.96 | -140.32 | -6.3 | -6.4 | 83.5 | 27.32 | 27.57 |
| 178057 | 98 | 5 | 24 | 980524 | 8.95 | -140.32 | -5.7 | -3.6 | 86.4 | 26.70 | 27.62 |
| 178058 | 98 | 5 | 25 | 980525 | 8.96 | -140.32 | -6.2 | -5.8 | 83.0 | 27.36 | 27.68 |
| 178059 | 98 | 5 | 26 | 980526 | 8.96 | -140.34 | -6.4 | -5.3 | 82.2 | 27.32 | 27.70 |
| 178060 | 98 | 5 | 27 | 980527 | 8.96 | -140.33 | -4.9 | -6.2 | 87.3 | 27.09 | 27.85 |
| 178061 | 98 | 5 | 28 | 980528 | 8.96 | -140.33 | -6.3 | -4.9 | 91.5 | 26.82 | 27.98 |
| 178062 | 98 | 5 | 29 | 980529 | 8.97 | -140.32 | -6.7 | -3.7 | 94.1 | 26.62 | 28.04 |
| 178063 | 98 | 5 | 30 | 980530 | 8.96 | -140.33 | -6.3 | -4.8 | 92.0 | 26.89 | 27.98 |
| 178064 | 98 | 5 | 31 | 980531 | 8.97 | -140.33 | -6.3 | -4.9 | 86.9 | 27.44 | 28.13 |
| 178065 | 98 | 6 | 1 | 980601 | 8.97 | -140.32 | -4.2 | -2.5 | 87.3 | 26.62 | 28.14 |
| 178066 | 98 | 6 | 2 | 980602 | 8.96 | -140.32 | -6.8 | -2.4 | 86.0 | 27.60 | 28.09 |
| 178067 | 98 | 6 | 3 | 980603 | 8.96 | -140.33 | -7.1 | -3.2 | 82.2 | 27.87 | 28.15 |
| 178068 | 98 | 6 | 4 | 980604 | 8.96 | -140.33 | -6.7 | -4.7 | 81.3 | 27.75 | 28.19 |
| 178069 | 98 | 6 | 5 | 980605 | 8.96 | -140.32 | -6.4 | -5.7 | 82.6 | 27.75 | 28.24 |
| 178070 | 98 | 6 | 6 | 980606 | 8.96 | -140.33 | -6.6 | -4.3 | 81.3 | 27.71 | 28.28 |
| 178071 | 98 | 6 | 7 | 980607 | 8.95 | -140.33 | -8.4 | -4.2 | 83.5 | 27.91 | 28.26 |
| 178072 | 98 | 6 | 8 | 980608 | 8.96 | -140.33 | -8.4 | -5.0 | 79.2 | 27.87 | 28.22 |
| 178073 | 98 | 6 | 9 | 980609 | 8.98 | -140.33 | -6.5 | -5.9 | 75.4 | 27.56 | 28.22 |
| 178074 | 98 | 6 | 10 | 980610 | 8.95 | -140.33 | -6.8 | -5.3 | 81.3 | 27.52 | 28.17 |
| 178075 | 98 | 6 | 11 | 980611 | 8.96 | -140.33 | -5.1 | -0.4 | 94.1 | 26.04 | 28.14 |
| 178076 | 98 | 6 | 12 | 980612 | 8.96 | -140.32 | -4.3 | -3.3 | 93.2 | 25.80 | 27.87 |
| 178077 | 98 | 6 | 13 | 980613 | 8.95 | -140.34 | -6.1 | -4.8 | 81.3 | 27.17 | 27.93 |
| 178078 | 98 | 6 | 14 | 980614 | 8.96 | -140.33 | -4.9 | -2.3 | 76.2 | 27.36 | 28.03 |
| 178079 | 98 | 6 | 15 | 980615 | 8.95 | -140.33 | NaN | NaN | NaN | 27.09 | 28.09 |
# Check the data info
remoteStations.info(verbose=True, show_counts=True)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 178080 entries, 0 to 178079 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 year 178080 non-null int64 1 month 178080 non-null int64 2 day 178080 non-null int64 3 date 178080 non-null int64 4 latitude 178080 non-null float64 5 longitude 178080 non-null float64 6 zon_winds 152917 non-null float64 7 mer_winds 152918 non-null float64 8 humidity 112319 non-null float64 9 air_temp 159843 non-null float64 10 ss_temp 161073 non-null float64 dtypes: float64(7), int64(4) memory usage: 14.9 MB
# Check the shape of both
remoteStations.shape
(178080, 11)
# Determine the number of missing values
# Syntax: DataFrame.isnull().sum()
remoteStations.isnull().sum()
year 0 month 0 day 0 date 0 latitude 0 longitude 0 zon_winds 25163 mer_winds 25162 humidity 65761 air_temp 18237 ss_temp 17007 dtype: int64
# Let's create a function to determine the percentage of missing values
# Typically less than five percent missing values may not affect the results
# More than 5% can be dropped, replaced with existing data, or imputed using mean or median.
def missing(DataFrame):
print ('Percentage of missing values in the dataset:\n',
round((DataFrame.isnull().sum() * 100/ len(DataFrame)),2).sort_values(ascending=False))
# Call the function and execute
# Syntax: missing(DataFrame)
missing(remoteStations)
Percentage of missing values in the dataset: humidity 36.93 zon_winds 14.13 mer_winds 14.13 air_temp 10.24 ss_temp 9.55 year 0.00 month 0.00 day 0.00 date 0.00 latitude 0.00 longitude 0.00 dtype: float64
# Let's see if all the null values are for a particular date or station
null_data = remoteStations[remoteStations.isnull().any(axis=1)]
# View the results
null_data.head(100)
| year | month | day | date | latitude | longitude | zon_winds | mer_winds | humidity | air_temp | ss_temp | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 80 | 3 | 7 | 800307 | -0.02 | -109.46 | -6.8 | 0.7 | NaN | 26.14 | 26.24 |
| 1 | 80 | 3 | 8 | 800308 | -0.02 | -109.46 | -4.9 | 1.1 | NaN | 25.66 | 25.97 |
| 2 | 80 | 3 | 9 | 800309 | -0.02 | -109.46 | -4.5 | 2.2 | NaN | 25.69 | 25.28 |
| 3 | 80 | 3 | 10 | 800310 | -0.02 | -109.46 | -3.8 | 1.9 | NaN | 25.57 | 24.31 |
| 4 | 80 | 3 | 11 | 800311 | -0.02 | -109.46 | -4.2 | 1.5 | NaN | 25.30 | 23.19 |
| 5 | 80 | 3 | 12 | 800312 | -0.02 | -109.46 | -4.4 | 0.3 | NaN | 24.72 | 23.64 |
| 6 | 80 | 3 | 13 | 800313 | -0.02 | -109.46 | -3.2 | 0.1 | NaN | 24.66 | 24.34 |
| 7 | 80 | 3 | 14 | 800314 | -0.02 | -109.46 | -3.1 | 0.6 | NaN | 25.17 | 24.14 |
| 8 | 80 | 3 | 15 | 800315 | -0.02 | -109.46 | -3.0 | 1.0 | NaN | 25.59 | 24.24 |
| 9 | 80 | 3 | 16 | 800316 | -0.02 | -109.46 | -1.2 | 1.0 | NaN | 26.71 | 25.94 |
| 10 | 80 | 3 | 17 | 800317 | -0.02 | -109.46 | -0.1 | 0.7 | NaN | 27.28 | 26.65 |
| 11 | 80 | 3 | 18 | 800318 | -0.02 | -109.46 | -1.2 | 2.3 | NaN | 26.86 | 27.13 |
| 12 | 80 | 3 | 19 | 800319 | -0.02 | -109.46 | -4.1 | -0.3 | NaN | 26.38 | 26.35 |
| 13 | 80 | 3 | 20 | 800320 | -0.02 | -109.46 | -4.8 | -0.8 | NaN | 26.19 | 25.87 |
| 14 | 80 | 3 | 21 | 800321 | -0.02 | -109.46 | -5.2 | 2.0 | NaN | 26.08 | 25.38 |
| 15 | 80 | 3 | 22 | 800322 | -0.02 | -109.46 | -2.7 | 2.7 | NaN | 26.24 | NaN |
| 16 | 80 | 3 | 23 | 800323 | -0.02 | -109.46 | -4.4 | 1.1 | NaN | 26.05 | NaN |
| 17 | 80 | 3 | 24 | 800324 | -0.02 | -109.46 | -4.3 | 0.7 | NaN | 25.67 | NaN |
| 18 | 80 | 3 | 25 | 800325 | -0.02 | -109.46 | -3.8 | 0.5 | NaN | 25.39 | NaN |
| 19 | 80 | 3 | 26 | 800326 | -0.02 | -109.46 | -3.0 | 0.2 | NaN | 25.17 | NaN |
| 20 | 80 | 3 | 27 | 800327 | -0.02 | -109.46 | -3.2 | -0.2 | NaN | 25.25 | NaN |
| 21 | 80 | 3 | 28 | 800328 | -0.02 | -109.46 | -1.9 | 0.7 | NaN | 25.35 | NaN |
| 22 | 80 | 3 | 29 | 800329 | -0.02 | -109.46 | -0.8 | 0.3 | NaN | 26.13 | NaN |
| 23 | 80 | 8 | 11 | 800811 | 0.00 | -109.56 | -3.3 | 1.5 | NaN | 21.48 | 21.81 |
| 24 | 80 | 8 | 12 | 800812 | 0.00 | -109.56 | -3.5 | 0.8 | NaN | 21.27 | 21.58 |
| 25 | 80 | 8 | 13 | 800813 | 0.00 | -109.56 | -4.9 | 1.9 | NaN | 21.11 | 21.32 |
| 26 | 80 | 8 | 14 | 800814 | 0.00 | -109.56 | -1.2 | 2.1 | NaN | 20.95 | 21.19 |
| 27 | 80 | 8 | 15 | 800815 | 0.00 | -109.56 | -1.2 | 2.7 | NaN | 21.76 | 21.47 |
| 28 | 80 | 8 | 16 | 800816 | 0.00 | -109.56 | -1.9 | 2.7 | NaN | 22.11 | 21.89 |
| 29 | 80 | 8 | 17 | 800817 | 0.00 | -109.56 | -4.2 | 0.7 | NaN | 21.69 | 21.85 |
| 30 | 80 | 8 | 18 | 800818 | 0.00 | -109.56 | -4.3 | 0.6 | NaN | 21.58 | 21.74 |
| 31 | 80 | 8 | 19 | 800819 | 0.00 | -109.56 | -5.1 | 1.4 | NaN | 21.45 | 21.70 |
| 32 | 80 | 8 | 20 | 800820 | 0.00 | -109.56 | -3.5 | 2.1 | NaN | 21.48 | 21.75 |
| 33 | 80 | 8 | 21 | 800821 | 0.00 | -109.56 | -4.9 | 2.0 | NaN | 21.61 | 21.70 |
| 34 | 80 | 8 | 22 | 800822 | 0.00 | -109.56 | -4.3 | 2.5 | NaN | 21.88 | 22.07 |
| 35 | 80 | 8 | 23 | 800823 | 0.00 | -109.56 | -7.2 | 3.7 | NaN | 22.07 | 22.32 |
| 36 | 80 | 8 | 24 | 800824 | 0.00 | -109.56 | -6.8 | 2.8 | NaN | 21.69 | 21.91 |
| 37 | 80 | 8 | 25 | 800825 | 0.00 | -109.56 | -5.1 | 3.3 | NaN | 21.83 | 21.96 |
| 38 | 80 | 8 | 26 | 800826 | 0.00 | -109.56 | -4.5 | 3.8 | NaN | 22.04 | 22.25 |
| 39 | 80 | 8 | 27 | 800827 | 0.00 | -109.56 | -3.8 | 3.7 | NaN | 22.36 | 22.61 |
| 40 | 80 | 8 | 28 | 800828 | 0.00 | -109.56 | -3.9 | 3.3 | NaN | 22.51 | 22.69 |
| 41 | 80 | 8 | 29 | 800829 | 0.00 | -109.56 | -5.0 | 2.4 | NaN | 22.45 | 23.08 |
| 42 | 80 | 8 | 30 | 800830 | 0.00 | -109.56 | -5.9 | 3.7 | NaN | 23.00 | 24.55 |
| 43 | 80 | 8 | 31 | 800831 | 0.00 | -109.56 | -6.2 | 5.1 | NaN | 23.12 | 24.79 |
| 44 | 80 | 9 | 1 | 800901 | 0.00 | -109.56 | -6.2 | 3.8 | NaN | 22.67 | 24.04 |
| 45 | 80 | 9 | 2 | 800902 | 0.00 | -109.56 | -5.4 | 2.7 | NaN | 22.08 | 23.00 |
| 46 | 80 | 9 | 3 | 800903 | 0.00 | -109.56 | -4.2 | 3.0 | NaN | 21.76 | 22.16 |
| 47 | 80 | 9 | 4 | 800904 | 0.00 | -109.56 | -3.7 | 1.7 | NaN | 21.90 | 21.85 |
| 48 | 80 | 9 | 5 | 800905 | 0.00 | -109.56 | -4.5 | 3.0 | NaN | 21.72 | 21.61 |
| 49 | 80 | 9 | 6 | 800906 | 0.00 | -109.56 | -4.1 | 2.4 | NaN | 21.84 | 21.52 |
| 50 | 80 | 9 | 7 | 800907 | 0.00 | -109.56 | -4.5 | 3.0 | NaN | 21.76 | 21.69 |
| 51 | 80 | 9 | 8 | 800908 | 0.00 | -109.56 | -4.1 | 3.5 | NaN | 21.95 | 21.80 |
| 52 | 80 | 9 | 9 | 800909 | 0.00 | -109.56 | -4.7 | 3.8 | NaN | 22.03 | 21.88 |
| 53 | 80 | 9 | 10 | 800910 | 0.00 | -109.56 | -4.4 | 2.6 | NaN | 22.03 | 22.09 |
| 54 | 80 | 9 | 11 | 800911 | 0.00 | -109.56 | -2.7 | 3.6 | NaN | 21.97 | 22.11 |
| 55 | 80 | 9 | 12 | 800912 | 0.00 | -109.56 | -3.6 | 4.2 | NaN | 21.97 | 22.00 |
| 56 | 80 | 9 | 13 | 800913 | 0.00 | -109.56 | -4.4 | 3.7 | NaN | 21.75 | 21.88 |
| 57 | 80 | 9 | 14 | 800914 | 0.00 | -109.56 | -3.3 | 2.5 | NaN | 21.89 | 21.93 |
| 58 | 80 | 9 | 15 | 800915 | 0.00 | -109.56 | -2.8 | 3.2 | NaN | 22.00 | 22.24 |
| 59 | 80 | 9 | 16 | 800916 | 0.00 | -109.56 | -1.4 | 3.8 | NaN | 22.33 | 22.59 |
| 60 | 80 | 9 | 17 | 800917 | 0.00 | -109.56 | -2.8 | 5.3 | NaN | 22.72 | 23.32 |
| 61 | 80 | 9 | 18 | 800918 | 0.00 | -109.56 | -4.7 | 5.2 | NaN | 22.93 | 23.13 |
| 62 | 80 | 9 | 19 | 800919 | 0.00 | -109.56 | -4.0 | 4.4 | NaN | 22.47 | 23.36 |
| 63 | 80 | 9 | 20 | 800920 | 0.00 | -109.56 | -6.0 | 4.2 | NaN | 22.92 | 23.69 |
| 64 | 80 | 9 | 21 | 800921 | 0.00 | -109.56 | -5.6 | 3.4 | NaN | 22.50 | 23.60 |
| 65 | 80 | 9 | 22 | 800922 | 0.00 | -109.56 | -5.3 | 4.6 | NaN | 22.70 | 24.11 |
| 66 | 80 | 9 | 23 | 800923 | 0.00 | -109.56 | -5.1 | 3.9 | NaN | 22.68 | 24.28 |
| 67 | 80 | 9 | 24 | 800924 | 0.00 | -109.56 | -4.5 | 3.3 | NaN | 22.58 | 23.77 |
| 68 | 80 | 9 | 25 | 800925 | 0.00 | -109.56 | -3.3 | 2.7 | NaN | 22.32 | 22.87 |
| 69 | 80 | 9 | 26 | 800926 | 0.00 | -109.56 | -2.6 | 3.7 | NaN | 21.81 | 22.05 |
| 70 | 80 | 9 | 27 | 800927 | 0.00 | -109.56 | -4.3 | 1.3 | NaN | 21.91 | 21.81 |
| 71 | 80 | 9 | 28 | 800928 | 0.00 | -109.56 | -4.0 | 1.3 | NaN | 21.81 | 21.71 |
| 72 | 80 | 9 | 29 | 800929 | 0.00 | -109.56 | -4.0 | 1.8 | NaN | 21.96 | 22.07 |
| 73 | 80 | 9 | 30 | 800930 | 0.00 | -109.56 | -4.9 | 2.3 | NaN | 22.05 | 22.38 |
| 74 | 80 | 10 | 1 | 801001 | 0.00 | -109.56 | -4.4 | 3.4 | NaN | 22.45 | 22.44 |
| 75 | 80 | 10 | 2 | 801002 | 0.00 | -109.56 | -4.8 | 4.2 | NaN | 22.32 | 22.16 |
| 76 | 80 | 10 | 3 | 801003 | 0.00 | -109.56 | -5.8 | 4.0 | NaN | 21.94 | 21.88 |
| 77 | 80 | 10 | 4 | 801004 | 0.00 | -109.56 | -4.1 | 2.5 | NaN | 21.53 | 21.66 |
| 78 | 80 | 10 | 5 | 801005 | 0.00 | -109.56 | -3.2 | 2.4 | NaN | 21.28 | 21.42 |
| 79 | 80 | 10 | 6 | 801006 | 0.00 | -109.56 | -3.2 | 3.2 | NaN | 21.12 | 21.17 |
| 80 | 80 | 10 | 7 | 801007 | 0.00 | -109.56 | -2.2 | 2.8 | NaN | 21.08 | 21.20 |
| 81 | 80 | 10 | 8 | 801008 | 0.00 | -109.56 | -2.7 | 2.7 | NaN | 21.02 | 20.97 |
| 82 | 80 | 10 | 9 | 801009 | 0.00 | -109.56 | -3.7 | 2.2 | NaN | 20.65 | 20.71 |
| 83 | 80 | 10 | 10 | 801010 | 0.00 | -109.56 | -3.5 | 1.1 | NaN | 20.71 | 20.74 |
| 84 | 80 | 10 | 11 | 801011 | 0.00 | -109.56 | -4.2 | 1.2 | NaN | 20.54 | 20.72 |
| 85 | 80 | 10 | 12 | 801012 | 0.00 | -109.56 | -2.8 | -0.3 | NaN | 20.45 | 20.51 |
| 86 | 80 | 10 | 13 | 801013 | 0.00 | -109.56 | -3.7 | 0.1 | NaN | 20.52 | 20.71 |
| 87 | 80 | 10 | 14 | 801014 | 0.00 | -109.56 | -2.9 | 0.4 | NaN | 20.42 | 20.52 |
| 88 | 80 | 10 | 15 | 801015 | 0.00 | -109.56 | -1.5 | -0.4 | NaN | 20.73 | 20.62 |
| 89 | 80 | 10 | 16 | 801016 | 0.00 | -109.56 | -2.3 | 1.6 | NaN | 20.62 | 20.47 |
| 90 | 80 | 10 | 17 | 801017 | 0.00 | -109.56 | -2.1 | 0.1 | NaN | 20.84 | 20.59 |
| 91 | 80 | 10 | 18 | 801018 | 0.00 | -109.56 | -2.0 | 1.3 | NaN | 20.92 | 20.73 |
| 92 | 80 | 10 | 19 | 801019 | 0.00 | -109.56 | -2.6 | 0.9 | NaN | 21.23 | 21.07 |
| 93 | 80 | 10 | 20 | 801020 | 0.00 | -109.56 | -3.8 | 0.1 | NaN | 21.19 | 21.08 |
| 94 | 80 | 10 | 21 | 801021 | 0.00 | -109.56 | -5.3 | 1.9 | NaN | 21.20 | 21.18 |
| 95 | 80 | 10 | 22 | 801022 | 0.00 | -109.56 | -4.9 | 3.0 | NaN | 21.45 | 21.36 |
| 96 | 80 | 10 | 23 | 801023 | 0.00 | -109.56 | -5.3 | 2.3 | NaN | 21.50 | 21.46 |
| 97 | 80 | 10 | 24 | 801024 | 0.00 | -109.56 | -5.1 | 3.6 | NaN | 21.80 | 21.60 |
| 98 | 80 | 10 | 25 | 801025 | 0.00 | -109.56 | -3.4 | 2.7 | NaN | 21.87 | 21.78 |
| 99 | 80 | 10 | 26 | 801026 | 0.00 | -109.56 | -6.3 | 2.2 | NaN | 21.60 | 21.84 |
# The top of the data seems to be missing humidity readings in 1980
# Check the bottom for the same.
null_data.tail(100)
| year | month | day | date | latitude | longitude | zon_winds | mer_winds | humidity | air_temp | ss_temp | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 177258 | 96 | 3 | 16 | 960316 | 9.00 | -140.25 | NaN | NaN | 80.3 | 25.76 | 26.50 |
| 177259 | 96 | 3 | 17 | 960317 | 9.00 | -140.25 | NaN | NaN | 80.7 | 26.00 | 26.46 |
| 177260 | 96 | 3 | 18 | 960318 | 9.00 | -140.26 | NaN | NaN | 81.9 | 26.15 | 26.51 |
| 177261 | 96 | 3 | 19 | 960319 | 9.01 | -140.26 | NaN | NaN | 85.5 | 25.96 | 26.47 |
| 177262 | 96 | 3 | 20 | 960320 | 9.00 | -140.25 | NaN | NaN | 86.7 | 26.19 | 26.41 |
| 177263 | 96 | 3 | 21 | 960321 | 8.99 | -140.25 | NaN | NaN | 81.5 | 26.15 | 26.41 |
| 177264 | 96 | 3 | 22 | 960322 | 9.00 | -140.26 | NaN | NaN | 81.1 | 26.19 | 26.42 |
| 177265 | 96 | 3 | 23 | 960323 | 9.00 | -140.26 | NaN | NaN | 84.3 | 26.08 | 26.45 |
| 177266 | 96 | 3 | 24 | 960324 | 9.00 | -140.26 | NaN | NaN | 85.5 | 26.00 | 26.41 |
| 177267 | 96 | 3 | 25 | 960325 | 9.00 | -140.26 | NaN | NaN | 90.3 | 25.64 | 26.46 |
| 177268 | 96 | 3 | 26 | 960326 | 9.00 | -140.26 | NaN | NaN | 90.3 | 26.19 | 26.48 |
| 177269 | 96 | 3 | 27 | 960327 | 9.00 | -140.27 | NaN | NaN | 83.5 | 26.31 | 26.50 |
| 177270 | 96 | 3 | 28 | 960328 | 9.00 | -140.26 | NaN | NaN | 81.5 | 26.23 | 26.54 |
| 177271 | 96 | 3 | 29 | 960329 | 9.00 | -140.26 | NaN | NaN | 82.7 | 26.23 | 26.52 |
| 177272 | 96 | 3 | 30 | 960330 | 9.01 | -140.26 | NaN | NaN | 83.9 | 26.31 | 26.54 |
| 177273 | 96 | 3 | 31 | 960331 | 9.00 | -140.26 | NaN | NaN | 90.7 | 26.08 | 26.56 |
| 177274 | 96 | 4 | 1 | 960401 | 9.00 | -140.26 | NaN | NaN | 90.3 | 26.35 | 26.64 |
| 177275 | 96 | 4 | 2 | 960402 | 9.01 | -140.26 | NaN | NaN | 82.7 | 26.23 | 26.69 |
| 177276 | 96 | 4 | 3 | 960403 | 9.00 | -140.26 | NaN | NaN | 75.2 | 26.43 | 26.69 |
| 177277 | 96 | 4 | 4 | 960404 | 9.01 | -140.26 | NaN | NaN | 81.1 | 26.31 | 26.80 |
| 177278 | 96 | 4 | 5 | 960405 | 9.00 | -140.26 | NaN | NaN | 87.1 | 26.15 | 26.81 |
| 177279 | 96 | 4 | 6 | 960406 | 9.00 | -140.27 | NaN | NaN | 90.7 | 25.76 | 26.83 |
| 177280 | 96 | 4 | 7 | 960407 | 9.01 | -140.26 | NaN | NaN | 83.9 | 26.12 | 26.82 |
| 177281 | 96 | 4 | 8 | 960408 | 9.00 | -140.26 | NaN | NaN | 82.7 | 25.92 | 26.80 |
| 177282 | 96 | 4 | 9 | 960409 | 9.00 | -140.26 | NaN | NaN | 83.1 | 26.08 | 26.82 |
| 177283 | 96 | 4 | 10 | 960410 | 9.00 | -140.26 | NaN | NaN | 84.7 | 26.43 | 26.73 |
| 177284 | 96 | 4 | 11 | 960411 | 8.99 | -140.26 | NaN | NaN | 86.3 | 26.47 | 26.73 |
| 177285 | 96 | 4 | 12 | 960412 | 9.00 | -140.26 | NaN | NaN | 80.7 | 26.39 | 26.83 |
| 177286 | 96 | 4 | 13 | 960413 | 9.00 | -140.26 | NaN | NaN | NaN | 26.27 | 26.85 |
| 177287 | 96 | 4 | 14 | 960414 | 9.00 | -140.26 | NaN | NaN | 93.5 | 24.94 | 26.81 |
| 177288 | 96 | 4 | 15 | 960415 | 9.00 | -140.26 | NaN | NaN | 90.7 | 25.06 | 26.59 |
| 177289 | 96 | 4 | 16 | 960416 | 9.00 | -140.27 | NaN | NaN | 85.9 | 25.88 | 26.77 |
| 177290 | 96 | 4 | 17 | 960417 | 9.00 | -140.25 | NaN | NaN | 77.9 | 26.15 | 26.84 |
| 177291 | 96 | 4 | 18 | 960418 | 9.00 | -140.27 | NaN | NaN | 84.3 | 26.23 | 26.84 |
| 177292 | 96 | 4 | 19 | 960419 | 8.99 | -140.27 | NaN | NaN | 85.5 | 26.39 | 26.83 |
| 177293 | 96 | 4 | 20 | 960420 | 9.00 | -140.26 | NaN | NaN | 83.1 | 26.55 | 26.84 |
| 177294 | 96 | 4 | 21 | 960421 | 8.99 | -140.26 | NaN | NaN | 82.7 | 26.35 | 26.91 |
| 177295 | 96 | 4 | 22 | 960422 | 9.00 | -140.26 | NaN | NaN | 90.7 | 26.04 | 26.92 |
| 177296 | 96 | 4 | 23 | 960423 | 9.00 | -140.27 | NaN | NaN | 89.5 | 26.39 | 26.89 |
| 177297 | 96 | 4 | 24 | 960424 | 9.00 | -140.26 | NaN | NaN | 85.5 | 26.59 | 26.89 |
| 177298 | 96 | 4 | 25 | 960425 | 9.01 | -140.27 | NaN | NaN | 80.7 | 26.35 | 26.86 |
| 177299 | 96 | 4 | 26 | 960426 | 9.01 | -140.26 | NaN | NaN | 90.7 | 25.84 | 26.90 |
| 177300 | 96 | 4 | 27 | 960427 | 9.00 | -140.27 | NaN | NaN | 89.9 | 26.00 | 26.88 |
| 177301 | 96 | 4 | 28 | 960428 | 9.00 | -140.26 | NaN | NaN | 85.5 | 26.27 | 26.88 |
| 177302 | 96 | 4 | 29 | 960429 | 9.00 | -140.26 | NaN | NaN | 85.9 | 26.35 | 26.93 |
| 177303 | 96 | 4 | 30 | 960430 | 9.00 | -140.26 | NaN | NaN | 87.1 | 26.62 | 26.97 |
| 177304 | 96 | 5 | 1 | 960501 | 9.00 | -140.26 | NaN | NaN | 90.7 | 26.59 | 27.07 |
| 177305 | 96 | 5 | 2 | 960502 | 9.00 | -140.26 | NaN | NaN | 91.1 | 26.08 | 27.10 |
| 177306 | 96 | 5 | 3 | 960503 | 9.00 | -140.26 | NaN | NaN | 91.9 | 26.00 | 27.12 |
| 177307 | 96 | 5 | 4 | 960504 | 9.00 | -140.26 | NaN | NaN | 88.7 | 26.35 | 27.11 |
| 177308 | 96 | 5 | 5 | 960505 | 9.00 | -140.26 | NaN | NaN | 91.5 | 26.19 | 27.04 |
| 177309 | 96 | 5 | 6 | 960506 | 9.00 | -140.27 | NaN | NaN | 91.9 | 26.39 | 27.08 |
| 177310 | 96 | 5 | 7 | 960507 | 9.00 | -140.26 | NaN | NaN | 89.9 | 26.78 | 27.06 |
| 177311 | 96 | 5 | 8 | 960508 | 9.00 | -140.27 | NaN | NaN | 85.9 | 26.82 | 27.05 |
| 177312 | 96 | 5 | 9 | 960509 | 9.00 | -140.26 | NaN | NaN | 81.5 | 26.94 | 27.09 |
| 177313 | 96 | 5 | 10 | 960510 | 9.00 | -140.26 | NaN | NaN | 83.1 | 26.78 | 27.14 |
| 177314 | 96 | 5 | 11 | 960511 | 9.00 | -140.26 | NaN | NaN | 83.5 | 26.66 | 27.08 |
| 177315 | 96 | 5 | 12 | 960512 | 9.00 | -140.26 | NaN | NaN | 88.7 | 26.59 | 27.09 |
| 177316 | 96 | 5 | 13 | 960513 | 9.00 | -140.26 | NaN | NaN | 90.7 | 26.39 | 27.11 |
| 177317 | 96 | 5 | 14 | 960514 | 8.99 | -140.27 | NaN | NaN | 85.9 | 26.59 | 27.04 |
| 177318 | 96 | 5 | 15 | 960515 | 9.00 | -140.26 | NaN | NaN | 91.9 | 25.53 | 26.92 |
| 177319 | 96 | 5 | 16 | 960516 | 9.00 | -140.26 | NaN | NaN | 91.9 | 25.37 | 26.93 |
| 177320 | 96 | 5 | 17 | 960517 | 8.99 | -140.26 | NaN | NaN | 85.9 | 26.47 | 26.96 |
| 177321 | 96 | 5 | 18 | 960518 | 9.00 | -140.26 | NaN | NaN | 85.1 | 26.19 | 26.93 |
| 177322 | 96 | 5 | 19 | 960519 | 9.00 | -140.26 | NaN | NaN | 85.9 | 26.78 | 26.98 |
| 177323 | 96 | 5 | 20 | 960520 | 9.00 | -140.27 | NaN | NaN | 83.5 | 26.66 | 26.96 |
| 177324 | 96 | 5 | 21 | 960521 | 9.00 | -140.26 | NaN | NaN | 81.1 | 26.35 | 26.98 |
| 177325 | 96 | 5 | 22 | 960522 | 9.00 | -140.26 | NaN | NaN | 82.7 | 26.66 | 27.06 |
| 177326 | 96 | 5 | 23 | 960523 | 9.00 | -140.26 | NaN | NaN | 82.3 | 26.66 | 27.08 |
| 177327 | 96 | 5 | 24 | 960524 | 8.98 | -140.28 | NaN | NaN | 81.5 | 26.55 | 27.17 |
| 177328 | 96 | 5 | 25 | 960525 | 8.99 | -140.26 | NaN | NaN | 81.5 | 26.59 | 27.15 |
| 177329 | 96 | 5 | 26 | 960526 | 9.00 | -140.26 | NaN | NaN | 86.3 | 26.55 | 27.19 |
| 177330 | 96 | 5 | 27 | 960527 | 9.00 | -140.26 | NaN | NaN | 83.5 | 26.70 | 27.24 |
| 177331 | 96 | 5 | 28 | 960528 | 9.00 | -140.26 | NaN | NaN | 83.9 | 26.78 | 27.28 |
| 177332 | 96 | 5 | 29 | 960529 | 9.00 | -140.26 | NaN | NaN | 81.5 | 26.55 | 27.23 |
| 177333 | 96 | 5 | 30 | 960530 | 9.00 | -140.26 | NaN | NaN | 83.9 | 26.31 | 27.21 |
| 177334 | 96 | 5 | 31 | 960531 | 9.00 | -140.26 | NaN | NaN | 88.3 | 26.08 | 27.17 |
| 177335 | 96 | 6 | 1 | 960601 | 9.00 | -140.26 | NaN | NaN | 83.5 | 26.66 | 27.40 |
| 177336 | 96 | 6 | 2 | 960602 | 9.00 | -140.26 | NaN | NaN | 83.1 | 26.55 | 27.37 |
| 177337 | 96 | 6 | 3 | 960603 | 9.00 | -140.27 | NaN | NaN | 89.9 | 26.27 | 27.15 |
| 177338 | 96 | 6 | 4 | 960604 | 9.00 | -140.27 | NaN | NaN | 91.5 | 25.64 | 27.13 |
| 177339 | 96 | 6 | 5 | 960605 | 9.00 | -140.26 | NaN | NaN | 93.1 | 24.98 | 27.11 |
| 177340 | 96 | 6 | 6 | 960606 | 9.00 | -140.26 | NaN | NaN | 85.5 | 26.39 | 27.13 |
| 177341 | 96 | 6 | 7 | 960607 | 9.00 | -140.28 | NaN | NaN | 81.1 | 26.59 | 27.25 |
| 177342 | 96 | 6 | 8 | 960608 | 9.00 | -140.26 | NaN | NaN | 90.7 | 25.49 | 27.30 |
| 177343 | 96 | 6 | 9 | 960609 | 9.00 | -140.26 | NaN | NaN | 87.1 | 26.39 | 27.29 |
| 177344 | 96 | 6 | 10 | 960610 | 9.00 | -140.26 | NaN | NaN | NaN | NaN | NaN |
| 177345 | 96 | 6 | 11 | 960611 | 9.00 | -140.28 | NaN | NaN | NaN | 25.43 | 27.26 |
| 177350 | 96 | 6 | 16 | 960616 | 9.00 | -140.28 | NaN | NaN | NaN | 25.38 | 27.59 |
| 177378 | 96 | 7 | 14 | 960714 | 9.00 | -140.28 | NaN | NaN | NaN | 26.55 | 27.92 |
| 177387 | 96 | 7 | 23 | 960723 | 9.00 | -140.27 | NaN | NaN | NaN | 25.77 | 27.84 |
| 177390 | 96 | 7 | 26 | 960726 | 9.00 | -140.26 | -4.8 | -1.4 | 87.8 | NaN | NaN |
| 177415 | 96 | 8 | 20 | 960820 | 9.00 | -140.27 | NaN | NaN | NaN | 27.17 | 28.49 |
| 177427 | 96 | 9 | 1 | 960901 | 8.99 | -140.27 | -0.5 | 4.9 | 80.1 | NaN | NaN |
| 177433 | 96 | 9 | 7 | 960907 | 9.00 | -140.26 | NaN | NaN | NaN | 27.64 | 28.22 |
| 177438 | 96 | 9 | 12 | 960912 | 9.00 | -140.26 | NaN | NaN | NaN | NaN | NaN |
| 177606 | 97 | 2 | 27 | 970227 | 9.00 | -140.28 | -8.0 | -4.2 | 86.7 | NaN | NaN |
| 177655 | 97 | 4 | 17 | 970417 | 9.00 | -140.00 | NaN | NaN | NaN | NaN | NaN |
| 177656 | 97 | 4 | 18 | 970418 | 8.98 | -140.30 | NaN | NaN | NaN | NaN | NaN |
| 178079 | 98 | 6 | 15 | 980615 | 8.95 | -140.33 | NaN | NaN | NaN | 27.09 | 28.09 |
# Check for unique values to determine which years are present in the dataset
remoteStations['year'].unique()
array([80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96,
97, 98])
# Check for unique values in the latitude and longitude
remoteStations['latitude'].unique()
array([-0.02, 0. , 0.01, -0.04, 0.03, 0.04, 0.06, 0.08, 0.07,
0.1 , 0.09, 0.11, -0.01, -0.03, 0.02, -0.06, -0.05, -0.07,
-0.08, -0.09, 0.05, 0.25, 0.5 , 0.58, 0.59, 0.33, 0.34,
0.37, 0.35, 0.56, 0.85, 0.15, 0.14, 0.13, 0.19, 0.16,
-0.2 , -0.21, -0.19, -0.18, -0.22, -0.23, -0.24, -0.25, -0.17,
-0.26, -0.27, -0.28, -0.29, -0.3 , -0.16, -0.15, -0.14, -0.52,
-0.79, -0.4 , -0.61, -0.71, -0.74, -0.72, -0.83, -0.88, 0.23,
0.27, 0.26, 0.28, 0.29, 0.3 , 0.31, 0.12, 0.22, 0.17,
0.2 , -0.1 , -0.13, -0.11, -0.12, 0.21, 0.38, 0.46, 0.65,
0.68, 1.08, 1.15, -0.31, -0.32, -0.33, -0.35, -0.39, -0.43,
-0.46, -0.51, -0.54, -0.56, -0.64, -0.37, -0.47, -0.5 , -0.57,
0.62, 0.89, 1.06, 0.97, 0.84, 0.45, 0.39, 0.36, 0.24,
0.32, 0.42, 0.47, 0.53, 0.71, 0.76, -0.42, -0.6 , -0.65,
-0.73, -0.8 , -0.76, -0.68, 0.18, 0.41, 0.4 , 0.44, 0.48,
0.49, 0.52, 0.55, 0.57, 0.43, 0.51, 0.6 , 0.61, 0.54,
-0.34, -0.41, -0.44, 1. , 2.15, 2.16, 2.14, 2.13, 2. ,
2.19, 2.2 , 2.18, 2.06, 2.07, 2.05, 2.08, 2.1 , 2.09,
1.99, 2.01, 1.98, 2.11, 2.02, 2.03, 2.04, 2.12, 2.34,
1.93, 1.92, 1.91, 1.94, 1.9 , 1.95, 1.96, 1.86, 1.77,
1.75, 1.69, 1.63, 1.57, 1.53, 1.49, 1.43, 1.37, 1.97,
2.43, 2.44, 2.42, 2.6 , 2.82, 2.45, 2.41, 2.38, 2.33,
2.32, 2.31, 2.35, 2.36, 2.4 , 2.26, 1.88, 1.87, 1.85,
1.83, 1.78, 2.17, 2.23, 2.27, 2.28, 2.25, 2.29, 2.3 ,
2.58, 3.49, 1.82, 1.76, 1.74, 1.73, 2.54, 2.61, 2.68,
2.69, 2.65, 2.64, 2.62, 2.39, 1.64, 1.19, 0.86, 1.2 ,
1.51, 1.8 , 2.75, 2.81, 2.88, 2.22, 2.46, 2.56, 2.66,
2.79, 2.87, 2.89, 2.92, 2.93, 2.98, 1.81, 1.79, 1.71,
1.59, 1.41, 1.46, 1.68, 2.47, 2.78, 2.91, 2.24, 2.52,
2.74, 2.83, 2.53, -2. , -2.03, -1.98, -2.02, -2.04, -2.05,
-2.06, -2.01, -1.99, -2.08, -2.09, -2.07, -2.1 , -2.13, -2.16,
-2.2 , -2.21, -2.17, -2.12, -1.96, -1.94, -1.92, -1.88, -1.86,
-1.87, -1.9 , -1.93, -2.14, -2.19, -2.22, -2.26, -2.27, -2.28,
-2.29, -2.23, -2.25, -2.32, -2.31, -2.33, -2.34, -2.3 , -2.18,
-2.15, -2.11, -2.24, -2.36, -2.38, -2.4 , -2.45, -2.47, -2.51,
-2.55, -2.57, -2.58, -2.53, -2.43, -2.42, -2.37, -2.39, -2.41,
-2.35, -2.44, -1.97, -1.84, -1.82, -1.81, -1.8 , -1.78, -1.77,
-1.83, -1.89, -1.95, -1.91, -2.46, -2.62, -2.72, -2.8 , -2.84,
-2.85, -2.83, -2.79, -2.76, -2.74, -2.7 , -2.68, -2.66, -2.67,
-2.71, -2.69, -2.64, -2.59, -2.56, -2.61, -2.52, -2.49, -1.73,
-1.74, -1.75, -1.85, -1.72, -1.64, -1.57, -1.51, -1.48, -1.47,
-1.46, -1.59, -1.79, -1.67, -1.61, -1.56, -1.44, -1.4 , -1.36,
-1.31, -1.34, -1.37, -1.39, -1.41, -1.42, -1.6 , -1.55, -1.63,
-1.69, -1.76, -1.27, -1.25, -1.26, -1.29, -1.5 , -1.32, -1.58,
-1.65, -2.89, -2.6 , -2.5 , -2.48, -2.54, -2.63, -2.65, 5. ,
5.06, 5.05, 5.04, 5.02, 5.01, 5.03, 5.13, 5.19, 5.18,
4.99, 4.98, 4.96, 4.97, 5.07, 5.08, 5.1 , 5.17, 5.16,
5.09, 5.24, 5.27, 5.3 , 5.34, 5.37, 5.38, 5.4 , 5.41,
5.31, 5.32, 5.33, 5.36, 5.39, 5.12, 5.11, 5.14, 5.2 ,
5.25, 5.26, 5.21, 5.29, 5.28, 5.35, 5.15, 4.95, 4.92,
4.94, 4.93, 5.23, 5.22, 4.9 , 4.89, 4.91, 4.86, 4.88,
4.87, 5.64, 5.91, 5.94, -5.01, -5. , -5.02, -4.99, -4.98,
-5.04, -5.05, -5.03, -5.06, -4.97, -4.96, -4.95, -4.9 , -4.8 ,
-4.77, -4.75, -4.7 , -4.59, -4.51, -4.45, -4.37, -4.18, -4.07,
-4.94, -5.16, -4.49, -5.07, -4.91, -4.92, -4.93, -4.89, -4.87,
-4.85, -4.83, -4.81, -4.79, -4.76, -4.71, -4.72, -4.73, -4.69,
-4.64, -4.63, -4.66, -4.67, -4.68, -4.65, -4.74, -4.78, -4.84,
-4.86, -4.88, -4.47, -4.34, -4.31, -4.27, -4.23, -4.22, -4.29,
-4.35, -4.4 , -4.41, -4.39, 7.03, 7.04, 7.05, 7.02, 7. ,
7.06, 7.07, 7.08, 7.09, 7.1 , 7.01, 7.17, 7.2 , 7.21,
7.23, 7.24, 6.95, 6.96, 6.97, 6.94, 6.98, 6.68, 6.69,
6.67, 6.66, 6.71, 6.65, 6.64, 6.7 , 6.72, 6.73, 6.74,
6.76, 6.75, 6.77, 6.78, 6.8 , 6.79, 6.93, 6.99, 8.03,
8.04, 8.02, 8.05, 8.06, 7.98, 8. , 8.01, 8.07, 7.99,
7.97, 8.13, 8.14, 8.12, 8.1 , 8.23, 8.17, 8.11, 8.15,
8.08, 7.96, 7.94, 7.95, 7.93, 7.92, 8.18, 8.16, 8.28,
8.35, 8.4 , 8.45, 8.43, 8.56, 8.53, 8.59, 8.68, 8.71,
8.73, 8.78, 8.81, 8.86, 8.89, 8.87, 8.93, 8.97, 8.98,
8.09, 7.9 , 7.91, 8.2 , 8.21, 8.19, 8.22, -8.03, -8.04,
-8.02, -8.08, -8.07, -8.09, -8.01, -8. , -7.98, -7.99, -8.05,
-8.06, -8.1 , -7.97, -7.96, -7.95, -7.94, -7.91, -7.9 , -7.87,
-7.8 , -7.69, -7.6 , -7.54, -7.5 , -7.43, -7.38, -7.31, -7.27,
-7.22, -7.15, -7.08, -7.07, -7.03, -8.26, -8.27, -8.24, -8.25,
-8.28, -8.29, -8.3 , -8.31, -8.33, -8.32, -8.16, -8.22, -8.15,
-8.19, -8.23, -8.36, -8.4 , -8.42, -8.43, -8.47, -8.5 , -8.52,
-8.56, -8.6 , -8.69, -8.77, -8.81, 8.99, 9. , 9.03, 8.96,
8.95, 9.01, 9.02, 9.04, 9.05, 8.94, 8.91])
remoteStations['longitude'].unique()
array([-109.46, -109.56, -109.66, ..., 163.18, 164.83, -140.31])
Observations¶
- Missing values are:
zon_winds25,163;mer_winds25162;humidity65761;air_temp18237;ss_temp17007. - The missing values are concentrated by station some of which appears to be missing the wind data or the temperature data. These data can be removed from the dataset during the cleaning phase.
- The missing data represents between 10-15% of the total dataset with the exception of
humiditywhich has 36% missing values.
- The years present include 1980 to 1998.
- The geo-positioning will require further analysis, there are a significant number of values for each drift point.
Action Items:
- Check for unique values in the latitude and longitude fields to see how many remote buoy stations are represented in the dataset.
- Use geopandas to plot the location of the buoys to give a geographic frame of reference to the analysis. Account for the floating drift when analyzing unique geocoordinates.
Week 3: Data Cleaning, Analysis, Preparation¶
The next steps are to begin the data cleaning. This would include handling any missing values, creating a sample of a larger dataset, and updating data types in Python. This is using the skills from the last three weeks to begin the hard work of cleaning the data to find those key insights.
Data Cleaning:¶
Headers Update and Map New Column¶
- To aid in analysis and visualization add a new columns as needed. For instance instead of an entire state name, add a column that has a two letter abbreviation for the state.
- The list of headers requires standardization, these will be updated ensure uniformity.
- Update ALL CAPS or all lowercase to the appropriate case. These can be updated to Title Case.
# Update date from integer to datetime
remoteStations['date'] = pd.to_datetime(remoteStations['date'], format='%y%m%d')
remoteStations['year'] = pd.to_datetime(remoteStations['year'], format='%y')
remoteStations['month'] = pd.to_datetime(remoteStations['month'], format='%m')
remoteStations['day'] = pd.to_datetime(remoteStations['day'], format='%d')
# Drop the rows
# Syntax: DataFrame.dropna(*, axis=0, how=_NoDefault.no_default, thresh=_NoDefault.no_default, subset=None, inplace=False, ignore_index=False)
remoteStations = remoteStations.dropna()
# Check the null count again
remoteStations.isnull().sum()
year 0 month 0 day 0 date 0 latitude 0 longitude 0 zon_winds 0 mer_winds 0 humidity 0 air_temp 0 ss_temp 0 dtype: int64
# Check the .info() again
remoteStations.info(verbose=True, show_counts=True)
<class 'pandas.core.frame.DataFrame'> Index: 93935 entries, 4059 to 178078 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 year 93935 non-null datetime64[ns] 1 month 93935 non-null datetime64[ns] 2 day 93935 non-null datetime64[ns] 3 date 93935 non-null datetime64[ns] 4 latitude 93935 non-null float64 5 longitude 93935 non-null float64 6 zon_winds 93935 non-null float64 7 mer_winds 93935 non-null float64 8 humidity 93935 non-null float64 9 air_temp 93935 non-null float64 10 ss_temp 93935 non-null float64 dtypes: datetime64[ns](4), float64(7) memory usage: 8.6 MB
Headers Update and Map New Column¶
- To aid in analysis and visualization add a new columns as needed. For instance instead of an entire state name, add a column that has a two letter abbreviation for the state.
- The list of headers requires standardization, these will be updated ensure uniformity.
- Update ALL CAPS or all lowercase to the appropriate case. These can be updated to Title Case.
# Let's create a list of the columns in the dataset
remoteStationsCols = remoteStations.columns
remoteStationsCols
Index(['year', 'month', 'day', 'date', 'latitude', 'longitude', 'zon_winds',
'mer_winds', 'humidity', 'air_temp', 'ss_temp'],
dtype='object')
Observations¶
The dataset now has the null values removed, headers remain as no additional transformations are necessary there, and is now eleven columns wide and 93,935 rows long. Next steps are to retype the data types including cleaning as needed for extreme values before feature engineering.
# Let's check the dtypes
remoteStations.info(verbose=True, show_counts=True, memory_usage=True)
<class 'pandas.core.frame.DataFrame'> Index: 93935 entries, 4059 to 178078 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 year 93935 non-null datetime64[ns] 1 month 93935 non-null datetime64[ns] 2 day 93935 non-null datetime64[ns] 3 date 93935 non-null datetime64[ns] 4 latitude 93935 non-null float64 5 longitude 93935 non-null float64 6 zon_winds 93935 non-null float64 7 mer_winds 93935 non-null float64 8 humidity 93935 non-null float64 9 air_temp 93935 non-null float64 10 ss_temp 93935 non-null float64 dtypes: datetime64[ns](4), float64(7) memory usage: 8.6 MB
# Quick backup
remoteStationsOG = remoteStations
Week 4: Data Exploration and Analysis¶
This week we will use statistical methods to learn more about the dataset. We can also create visualizations such as charts and graphs to see how the data statistics perform. These charts can be extracted or saved to be placed in reports or executive summaries.
Data Exploration:¶
Data Exploration:¶
The data is now eighteen years of measurements from remote sensors and is now 93,935 data points and ten features. As we retained the year, day, month, either the date or the parsed fields need to be dropped from modeling steps.
Summary Statistics¶
The summary statistics can give insight into the data, showing the maximum, minimum, and the variation in the data from the mean, median, and standard deviation. These are all markers of the quality of the data and can yield insight into how well the data can fit particular types of models.
## describe the descriptive stats
# Syntax: DataFrame.describe()
remoteStations.describe().T
| count | mean | min | 25% | 50% | 75% | max | std | |
|---|---|---|---|---|---|---|---|---|
| year | 93935 | 1995-04-29 06:18:47.886304384 | 1989-11-29 00:00:00 | 1993-10-28 00:00:00 | 1995-04-30 00:00:00 | 1996-12-05 00:00:00 | 1998-06-23 00:00:00 | NaN |
| month | 93935 | 1995-04-29 06:18:47.886304384 | 1989-11-29 00:00:00 | 1993-10-28 00:00:00 | 1995-04-30 00:00:00 | 1996-12-05 00:00:00 | 1998-06-23 00:00:00 | NaN |
| day | 93935 | 1995-04-29 06:18:47.886304384 | 1989-11-29 00:00:00 | 1993-10-28 00:00:00 | 1995-04-30 00:00:00 | 1996-12-05 00:00:00 | 1998-06-23 00:00:00 | NaN |
| date | 93935 | 1995-04-29 06:18:47.886304384 | 1989-11-29 00:00:00 | 1993-10-28 00:00:00 | 1995-04-30 00:00:00 | 1996-12-05 00:00:00 | 1998-06-23 00:00:00 | NaN |
| latitude | 93935.0 | 0.304808 | -8.33 | -2.16 | 0.01 | 4.98 | 9.05 | 4.770791 |
| longitude | 93935.0 | -70.836832 | -180.0 | -155.0 | -125.0 | -94.96 | 170.01 | 128.731985 |
| zon_winds | 93935.0 | -3.352878 | -10.7 | -5.9 | -4.1 | -1.5 | 14.3 | 3.42321 |
| mer_winds | 93935.0 | -0.046458 | -10.6 | -2.1 | -0.1 | 2.0 | 13.0 | 3.021228 |
| humidity | 93935.0 | 81.325627 | 52.1 | 77.7 | 81.3 | 84.8 | 99.9 | 5.275265 |
| air_temp | 93935.0 | 27.062438 | 17.54 | 26.35 | 27.46 | 28.21 | 31.48 | 1.674481 |
| ss_temp | 93935.0 | 27.882108 | 18.19 | 27.05 | 28.37 | 29.22 | 31.04 | 1.871993 |
Observation of Descriptive Statistics¶
- The zonal and meridonial winds are represented as a vector where the sign indicates directionality and the value indicates the magnitude of the wind speed. The wind speed ranges from 10.7 to 14.3 for the zonal winds and 10.6 to 13.0 for the meridonial winds. Both have a close standard deviation with values at quartile 1 and quartile 3 within one standard deviation of the median. The mean and median are close indicating the likelihood of a normal distribution for the data.
- Humidity has a wider range, with a minimum value of 52.1 and a maximum of 99.9. While the standard deviation is small, the Q1 value is more than one standard deviation away and the data appears to have a right sided skewness to the distribution.
- What are the mean values for the data?
- Are the mean and median values close to each other? If so this could indicate a normal distribution of the data. If not, this could indicate skewness in the data. If the mean is smaller than the median the values are likely skewed left, toward the minimum value. If the mean is larger than the median then there is skewness to the right indicating more high values in the data distribution.
- What are the quartile ranges for the data? What value is the 25th percentile of the data? What value is the 75th percentile of the data?
- How does the standard deviation for the data compare to the mean? High values for the standard deviation indicate a large variation in the data and likely a wide spread of the data across the range from minimum to maximum.
Dataframe Subset:
- Consider the same questions above.
Observations of Descriptive Statistics¶
The following are some observations about each table:
DataFrame:
latitude,longitude, anddepthare all the same values without change now that the data is segmented to one station. These can be removed from the dataset.- The spread of the data in the temperature, conductivity, salinity, and sigmat seems unnaturally extreme. The max values are enormous orders of magnitude higher than the minimum value and has greatly skewed the mean. The median value is closer in value to the values between the 25% and 75% so further investigation into these outliers is necessary before conclusions can be drawn from these descriptive statistics.
# Check the Max Values
# Find the row(s) with the extreme values
maxValues = remoteStations.max()
maxValues
year 1998-06-23 00:00:00 month 1998-06-23 00:00:00 day 1998-06-23 00:00:00 date 1998-06-23 00:00:00 latitude 9.05 longitude 170.01 zon_winds 14.3 mer_winds 13.0 humidity 99.9 air_temp 31.48 ss_temp 31.04 dtype: object
The warmest temperature recorded for the Pacific Ocean off the coast of San Diego was 26°C. Values greater than that are errors in the sensor data and need to be removed. We will set the threshold at 27°C to allow fractional values captured since the 2018 date of the record high.
# Check the min Values
# Find the row(s) with the extreme values
minValues = remoteStations.min()
minValues
year 1989-11-29 00:00:00 month 1989-11-29 00:00:00 day 1989-11-29 00:00:00 date 1989-11-29 00:00:00 latitude -8.33 longitude -180.0 zon_winds -10.7 mer_winds -10.6 humidity 52.1 air_temp 17.54 ss_temp 18.19 dtype: object
The max values are within functional range of the extremes we see in the Pacific Ocean. The minimum value is problematic, a zero salinity and temperature is not possible for the location being sampled. The lowest recorded temperature was 11.25°C in April 2023.
ETL: Additional variables¶
To complete all of the analyses there are additional variables that will need to be mapped to the existing remoteStations dataset.
Action Items: Additional variable for the predictor target is required to determine storm prediction. To predict hurricanes the variable needs to be set for the wind speed. The Saffir/Simpson Hurricane Wind Scale based on sustained wind speed for one minute or more.
Action Items: Classification as a Hurricane or Tropical Cyclone. Wind speeds for Hurricanes and Cyclones are very different.
- Tropical Depression: wind speed is 38 miles per hour or less.
- Tropical Cyclone: wind speed is between 39 and 73 miles per hour.
- Hurricane, category 1: wind speeds between 74-95 miles per hour.
- Hurricane, category 2: wind speeds between 96-110 miles per hour.
- Hurricane, category 3: wind speeds between 111-130 miles per hour.
- Hurricane, category 4: wind speeds between 131-155 miles per hour.
- Hurricane, category 5: wind speeds between 156-192 miles per hour.
- Hurricane, category 6: wind speeds greater than 192 miles per hour.
- Typhoon: Tropical Cyclone in the Northwest Pacific Ocean. Super Typhoon is a subclass with winds greater than or equal to 150 miles per hour.
# Make a copy of the DataFrame before manipulation
# Syntax: DataFrameOG = workingDF
remoteStationsOG = remoteStations
Correlation¶
- Correlation is a statistic that measures the degree to which two variables move in relation to each other. A positive correlation indicates the extent to which those variables increase or decrease in parallel;
- A negative correlation indicates the extent to which one variable increases as the other decreases.
- Correlation among multiple variables can be represented in the form of a matrix. This allows us to see which pairs have the high correlations.
- Correlation is a mutual relationship or connection between two or more things. It takes a value between (+1) and (-1)
- One important note here; Correlation can be created between integer values, so columns come with string values will not be included.
# Create correlation matrix
remoteStationsCorr = remoteStations.corr(numeric_only=True)
# Now call the correlation variable to see the correlation matrix.
remoteStationsCorr
| latitude | longitude | zon_winds | mer_winds | humidity | air_temp | ss_temp | |
|---|---|---|---|---|---|---|---|
| latitude | 1.000000 | 0.096651 | 0.117911 | -0.092178 | 0.158111 | 0.076123 | 0.125119 |
| longitude | 0.096651 | 1.000000 | 0.364256 | -0.024335 | -0.042777 | 0.249050 | 0.304027 |
| zon_winds | 0.117911 | 0.364256 | 1.000000 | 0.079763 | 0.063553 | 0.233156 | 0.376015 |
| mer_winds | -0.092178 | -0.024335 | 0.079763 | 1.000000 | 0.077647 | -0.339254 | -0.284897 |
| humidity | 0.158111 | -0.042777 | 0.063553 | 0.077647 | 1.000000 | -0.388059 | -0.324348 |
| air_temp | 0.076123 | 0.249050 | 0.233156 | -0.339254 | -0.388059 | 1.000000 | 0.940233 |
| ss_temp | 0.125119 | 0.304027 | 0.376015 | -0.284897 | -0.324348 | 0.940233 | 1.000000 |
Observations of the Correlation Matrix¶
Correlation matrices can be viewed in a visualization or a visual table that shows the relative relationship between the variables using color while stating their values. We will use a color map (cmap) with a high contrast to see those that correlate by color. Remember that a correlation matrix is a square that is a mirror image across the diagonal. This means the bottom half of the matric looks exactly like the top half of the matrix. To minimize the values to view, let's use the triu argument to view just the lower half of the correlation matrix.
# Set seaborn themes
sns.set_theme(style='white')
sns.color_palette('viridis', as_cmap=True)
# To get a correlation matrix
# Ploting the heat map
# Create the plot
plt.figure(figsize=(6,4))
matrix = remoteStationsCorr
mask = np.triu(np.ones_like(matrix, dtype=float))
sns.heatmap(remoteStationsCorr,
annot=True,
linewidths=0.7,
cmap='viridis',
fmt= '.2f',
mask=mask)
# specify name of the plot
plt.title('Correlation Between Features')
plt.show()
Observations¶
Are all the values the same color? This is called multicolinearity and indicates there are multiple independent variables that each have a strong relationship on each other. For instance if you are examining crime data categories such as robbery may also correlate to vehicular theft as the assailant was charged with both crimes. While they are independent crimes, they often occur together indicating a relationship. Multicolinear relationships complicate feature engineering for machine learning models and may need to have their dimensionality reduced (dropping columns or further subsets) to make sure the model trains well for those specific variables.
Are specific variables correlated higher than others?
Are there negative correlations indicating an inverse relationship in the variables? This indicates that as one variable is increasing, the other variable is decreasing. Negative correlations can be high (close to -1) or low (close to 0).
Remember that correlation does not equal causation. Be careful with your wording when establishing relationships between the variables.
Are there variables that lack correlation to any other variable? These are variables that may not be needed in the analysis and can be used to reduce the dimensionality of the data.
Additional Statistical methods are possible. Python is a mathematical programming language and can perform inferential statistics, hypothesis testing, probability distributions, and multivariate statistical analysis.
Visualizations¶
Create Visualizations to aid in the interpretation of the data and answering of the research problem. Using Python plotting libraries seaborn, matplotlib, plotly, or bokeh multiple plots will be completed to see trends and insights in the data.
- Use charts, graphs, maps, and other plots to answer questions related to your research question.
- Bivariate analysis is the process of examing two variables to visualize their relationship. Choose variables from the correlation matrix to see how they affect each other.
- Be sure to use the correct chart for the type of information you need.
# Set Plotly themes
# Javascript Interactive Visualization (computationally expensive)
import plotly.express as px
import plotly.io as pio
pio.templates.default = 'plotly_white'
# Generate a histogram and boxplot for each variable
# Add a calculation for skewness and round to two decimal values
# Add exception coding to skip columns where dtype=object
def analyze_column(df, col):
try:
if df[col].dtype == object:
# Skip categorical columns
raise TypeError("Skipping column with dtype object")
print(col)
# Add skewness calculation and round to two decimals
print('Skew :', round(df[col].skew(), 2))
plt.figure(figsize = (15, 4))
plt.subplot(1, 2, 1)
df[col].hist(bins=10, grid=False)
plt.ylabel('count')
plt.subplot(1, 2, 2)
sns.boxplot(x = df[col])
plt.show()
except TypeError as e:
print(e)
# Run the function to generate the plots
for col in remoteStations.columns:
analyze_column(remoteStations.copy(), col)
year 'DatetimeArray' with dtype datetime64[ns] does not support reduction 'skew' month 'DatetimeArray' with dtype datetime64[ns] does not support reduction 'skew' day 'DatetimeArray' with dtype datetime64[ns] does not support reduction 'skew' date 'DatetimeArray' with dtype datetime64[ns] does not support reduction 'skew' latitude Skew : -0.0
longitude Skew : 1.15
zon_winds Skew : 0.98
mer_winds Skew : 0.03
humidity Skew : 0.1
air_temp Skew : -1.47
ss_temp Skew : -1.46
Observations¶
mpg demonstrates right sided skewness with higher counts at 15 to 25. The distribution demonstrates visual positive skewness that corresponds to the positive skew value of 0.46. The boxplot shows values concentrated between 17 and 30 with long whiskers to a minimum value less than 10 and maximum value beyond the uppermost whisker. There are clear outliers in the mpg values.
cylinders countplot does not have a clear distribution. Values range from 3 to 8 with a predominance of values at four cylinders. Signifiant values rexist for six to eight cylinders. The boxplot supports this with values from 4 to 8. Low whiskers trend down to a value of three.
displacement This data demonstrates strong right skewness, consistent with the skewness value of 0.72. The predominance of values are between 75 and 150 then peaking again at 250 before trailing down to less than 20 values greater than 450. The boxplot confirms the range of 100 to 250. Whiskers show minimum values greater than 50 but less than 100. The Maximum value in approximately 450. This data point is greater than two standard deviations greater than the mean visualized as a gray line near 150 in the boxplot.
horsepower is a highly skewed variable with values trending to right skewness. This is consistent with the skew value of 1.1. Majority of values are between 75 and 125 as confirmed by the boxplot. Values are as low as 50 and up to 200. Outliers are visible greater than 200 for nine values.
weight shows a near normal distribution with the predominance of values between 12.5 to 17.5 with minimal skewness. This is supported by the skew value of 0.28. The boxplot shows whiskers for minimum values near 10 and maximum near 22.5. Outlier values can be seen both above the max and below the minimum.
year and origin are without visible distribution. The year ranges from 1973 to 1979 with values starting near 1970 and ending near 1982.
remoteStationsCols
Index(['year', 'month', 'day', 'date', 'latitude', 'longitude', 'zon_winds',
'mer_winds', 'humidity', 'air_temp', 'ss_temp'],
dtype='object')
# Let's view a pairplot to visualize multiple graphs at once.
# Include the KDE, search for linear relationships, confirm with the correlation matrix.
sns.pairplot(remoteStations[['zon_winds',
'mer_winds',
'humidity',
'air_temp',
'ss_temp']],
diag_kind="kde",
corner=True,
palette = "husl")
<seaborn.axisgrid.PairGrid at 0x79d7835f5a20>
Observations
There is one outlier point at -1 for conductivity that is skewing the linear projection. Otherwise the linear projection appears to visually fit the data in the direction and concentration of the data points.
Visualizations¶
Create Visualizations to aid in the interpretation of the data and answering of the research problem. Using Python plotting libraries seaborn, matplotlib, plotly, or bokeh multiple plots will be completed to see trends and insights in the data.
- Use charts, graphs, maps, and other plots to answer questions related to your research question.
- Bivariate analysis is the process of examing two variables to visualize their relationship. Choose variables from the correlation matrix to see how they affect each other.
- Be sure to use the correct chart for the type of information you need.
# Let's plot a linear regression model of the variables with a visual linear relationship
f, ax = plt.subplots(figsize = (15, 4))
line1 = sns.regplot(x=remoteStations['air_temp'],
y=remoteStations['ss_temp'],
data=remoteStations,
ax=ax,
label="Air Temperature vs. Sea Surface Temperature",
color='tab:blue')
line1.set_title("Air Temperature vs. Sea Surface Temperature")
line1.set_xlabel("Air Temperature")
line1.set_ylabel("Sea Surface Temperature")
plt.legend()
plt.show()
Observations¶
# Group data by air_temp and calculate the average percentage of ss_temp for each air_temp
averageair_temp = remoteStations.groupby('air_temp')['ss_temp'].mean()
# Sort air_temp based on the highest average percentage of ss_temp
maxair_temp = averageair_temp.sort_values(ascending=False).head(50)
# Create a scatter plot
plt.figure(figsize=(12, 6))
plt.scatter(maxair_temp.index, maxair_temp.values)
# Describe the axes
plt.xlabel('air_temp')
plt.ylabel('Average Percentage of Sea Surface Temp')
# Define the title
plt.title('Air Temp with Highest Average Percentage of Sea Surface Temp')
# Rotate the x-axis labels
plt.xticks(rotation=90)
# Complete the plot
plt.grid(axis='y')
plt.tight_layout()
# Show the visualization
plt.show()
Observations: Scatter Plot¶
Each visualization should have a summary of the findings.
# Set Plotly themes
# Javascript Interactive Visualization (computationally expensive)
import plotly.express as px
import plotly.io as pio
pio.templates.default = 'plotly_white'
# Create an averages table
variable_avg = remoteStations.groupby('air_temp').mean().sort_values(by='ss_temp',ascending=False)
variable_avg
# Create visualization: Bar chart
fig = px.bar(variable_avg,
x="ss_temp",
y=variable_avg.index,
color=variable_avg.index,
orientation='h',
height=1000,
title='Changes in Air Temp by Sea Surface Temp',
color_continuous_scale=px.colors.sequential.Viridis)
# View the plot
fig.show()
Observations: Bar Chart¶
Each visualization should have a summary of the findings.
Research Question: What is the interaction between the Zonal and Meridonial Winds with Increasing Humidity?¶
The influence of winds on a climate phenomenon called ENSO (El Niño Southern Oscillation). The Zonal winds and Meridonial winds in the atmosphere can affect ENSO's characteristics. The stronger winds, both zonal (east-west) and meridional (north-south), can reduce the intensity of ENSO, particularly in the eastern Pacific Ocean. Additionally, strong winds can prevent the north-south movement of a band of warm ocean water near the equator, known as the ITCZ (Intertropical Convergence Zone). The influence of zonal winds is shown to be more significant than that of meridional winds.
# Create visualization: Plotly Scatter chart
fig = px.scatter(remoteStations,x='zon_winds',
y = 'mer_winds',
color = 'humidity',
title = 'Zonal Winds versus Meridional Wind direction by Humidity',
color_continuous_scale=px.colors.sequential.Viridis)
fig.show()
What is the inteaction between the Zonal and Meridonial winds with Sea Surface Temperature and increasing Humidity?¶
Strong winds weaken certain positive feedback mechanisms that are crucial for ENSO's development. These feedback loops involve temperature variations in the ocean and atmospheric circulation patterns. This suggests that the observed strengthening of zonal and meridional winds over the past few decades might be a contributing factor to the shift in El Niño's characteristics observed after the year 2000.
# Create visualization: Plotly Scatter chart
fig = px.scatter(remoteStations,x='zon_winds',
y = 'ss_temp',
color = 'humidity',
title = 'Zonal Winds versus Sea Surface Temp by Humidity',
color_continuous_scale=px.colors.sequential.Viridis)
fig.show()
# Create visualization: Plotly Scatter chart
fig = px.scatter(remoteStations,x='mer_winds',
y = 'ss_temp',
color = 'humidity',
title = 'Meridional Wind direction versus Sea Surface Temp by Humidity',
color_continuous_scale=px.colors.sequential.Viridis)
fig.show()
Observations¶
Each visualization should have a summary of the findings.
Research Question: How does the air temperature change across each month in the year?¶
Strengthening of either wind component can reduce ENSO amplitude [1]. The easterly winds and thermocline depth are inversely related [1]. Easterly winds weaken the thermocline feedback, which reduces El Niño events [1]. However, the cold temperature anomalies induced by the easterly winds are confined to the eastern Pacific Ocean [1]. There is also a discussion about the positive wind-evaporation-SST (WES) feedback, which is stronger south of the equator [1]. There is a complex relationship between air temperature and winds in the Pacific Ocean, but certain wind patterns can cool the eastern Pacific Ocean [1].
# Create visualization: Plotly Line chart
fig = px.line(remoteStations, x="year",
y="air_temp",
color="month",
title='Sea Surface Temp compared to Air Temp by Humidity')
fig.show()
Output hidden; open in https://colab.research.google.com to view.
Observations¶
Easterly winds can weaken the thermocline feedback, which reduces El Niño events and cools the eastern Pacific Ocean [1]. The positive wind-evaporation-SST (WES) feedback, which is stronger south of the equator, also plays a role in the relationship between air temperature and winds [1]. From this visualization, we observe the following:
- Air temperature is cyclical across the months with a seasonality. This suggests the data could be modeled using a SARIMA (Seasonal Autoregressive Integrated Moving Average), a versatile and widely used time series forecasting model. To predict the future air temperatures and the possible changes to the El Niño weather pattern.
- The spread of the daily air temperature is increasing over the years as compared to previous years. A predictive model could determine the maximum increases in air temperature and the relative effect on the El Niño weather patterns. Will increases in air temperature increase the intensity of the El Niño weather patterns?